By Ethan Xu, Emily Heiss, and Brandon Quan
The purpose of this page is to guide you through the Data Science process and answer some essential questions about the COVID-19 pandemic.
Around the world, there is a vast amount of "vaccine hesitancy". Vaccine hesitancy is defined by the CDC as individuals who state they are "unsure", "probably won't", or "definitely won't" receive the vaccine. You can look at the CDC's visualization and article to understand the extent of vaccine hesitancy in the United States, but many of you will be familiar with the status of the vaccine debate in the United States. What about the rest of world? Who's right in this debate? How do the vaccines affect the rate of cases and severity of symptoms? Perhaps the numbers coming from the producers of the vaccines aren't enough to convince you one way or other. We believe we can help you take a more informed position through the use of data science.
Table of Contents
Before we get into the data science, let's go over some essential technology we'll be working with. This whole project was created in jupyter notebook. Jupyter notebook supports many different languages, but we're going to use the one you've probably heard the most about Python3.
Python has several libraries that help us immensely. A list has been provided below, but don't feel like you have to read through every link. Understanding every piece of tech isn't essential.
We use more than just the four libraries above, but they are the most essential.
Now onto the science! The first step of the Data Science process is data collection. We're going need some data to analyze. At this point, it's important to consider what question are we asking? We'll start off with something simple. How do the most common COVID-19 vaccines affect the number and severity of cases?
In order to answer this question, we're going to grab some data from the WHO (World Health Organization). Click here to see a nice table the WHO has provided on Covid around the world. We're going to download some of the csv's the WHO used to create this data.
Please note: You can download the csv's yourself and follow along in your own notebook, or you can follow this link to a our google drive and run this notebook yourself! Everything's already there, the csv's are downloaded and all the code is written. We heavily recommend this option.
Let's begin by importing the necessary libraries.
import datetime
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
import sys
from sklearn.svm import SVC
from scipy import stats
from scipy.stats import norm
from sklearn import linear_model
from folium.plugins import TimestampedGeoJson
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
Next we're going to get the data out of the csv's and into a Pandas dataframe.
# url="https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
# c=pd.read_csv(url)
cases_deaths_df = pd.read_csv('https://raw.githubusercontent.com/EthanXuXu/ethanxuxu.github.io/main/WHO-COVID-19-global-table-data.csv', delimiter=",")
cases_deaths_df.head()
| Name | WHO Region | Cases - cumulative total | Cases - cumulative total per 100000 population | Cases - newly reported in last 7 days | Cases - newly reported in last 7 days per 100000 population | Cases - newly reported in last 24 hours | Deaths - cumulative total | Deaths - cumulative total per 100000 population | Deaths - newly reported in last 7 days | Deaths - newly reported in last 7 days per 100000 population | Deaths - newly reported in last 24 hours | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Global | NaN | 199466211 | 2559.050333 | 4158340 | 53.349393 | 548167 | 4244541 | 54.455309 | 64036 | 0.821549 | 8430 |
| 1 | United States of America | Americas | 35010407 | 10577.080000 | 618994 | 187.010000 | 78722 | 609022 | 183.990000 | 2759 | 0.830000 | 387 |
| 2 | India | South-East Asia | 31769132 | 2302.100000 | 284527 | 20.620000 | 42625 | 425757 | 30.850000 | 3735 | 0.270000 | 562 |
| 3 | Brazil | Americas | 19953501 | 9387.260000 | 245839 | 115.660000 | 15143 | 557223 | 262.150000 | 6721 | 3.160000 | 389 |
| 4 | Russian Federation | Europe | 6356784 | 4355.920000 | 161552 | 110.700000 | 22589 | 161715 | 110.810000 | 5537 | 3.790000 | 790 |
vax_df = pd.read_csv('https://covid19.who.int/who-data/vaccination-data.csv', delimiter=",")
vax_df.head()
| COUNTRY | ISO3 | WHO_REGION | DATA_SOURCE | DATE_UPDATED | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | VACCINES_USED | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Falkland Islands (Malvinas) | FLK | AMRO | OWID | 2021-04-14 | 4407 | 2632.0 | 126.529 | 75.567 | 1775.0 | 50.962 | AstraZeneca - AZD1222 | NaN | 1.0 |
| 1 | Saint Helena | SHN | AFRO | OWID | 2021-05-05 | 7892 | 4361.0 | 129.995 | 71.833 | 3531.0 | 58.162 | AstraZeneca - AZD1222 | NaN | 1.0 |
| 2 | Faroe Islands | FRO | EURO | OWID | 2021-11-05 | 78347 | 40205.0 | 160.334 | 82.278 | 38142.0 | 78.056 | Moderna - mRNA-1273, Pfizer BioNTech - Comirnaty | NaN | 2.0 |
| 3 | Greenland | GRL | EURO | OWID | 2021-12-15 | 78240 | 40385.0 | 137.814 | 71.135 | 37855.0 | 66.679 | Moderna - mRNA-1273 | NaN | 1.0 |
| 4 | Jersey | JEY | EURO | OWID | 2021-12-12 | 196865 | 80602.0 | 182.627 | 74.773 | 76133.0 | 70.627 | Moderna - mRNA-1273, AstraZeneca - AZD1222, Pf... | NaN | 3.0 |
We've just collected a lot of data from the internet, but it's a bit messy. It's missing values. There's variables we don't need. It's hard to read and difficult to work with when we're coding. We've reached the second part of the data science process: Cleaning the data
Let's go ahead and drop the variables we don't need
#PYTHON CODE HERE DROP UNNECESSARY COLUMNS
vax_df.drop(["ISO3", "WHO_REGION", "DATA_SOURCE", "DATE_UPDATED"], axis=1, inplace=True)
cases_deaths_df.drop(["WHO Region",
"Deaths - newly reported in last 24 hours"], axis=1, inplace=True)
We've dropped what we don't need. Let's merge these tables together so all the data is in one place.
# JOIN TABLES BASED ON THE NAME OF THE COUNTRY
# cases_deaths_df['COUNTRY'] = cases_deaths_df.index
cases_deaths_df.rename(columns={"Name":"COUNTRY"}, inplace=True)
merged_df = pd.merge(vax_df, cases_deaths_df, on="COUNTRY")
merged_df.head()
| COUNTRY | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | VACCINES_USED | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | Cases - cumulative total | Cases - cumulative total per 100000 population | Cases - newly reported in last 7 days | Cases - newly reported in last 7 days per 100000 population | Cases - newly reported in last 24 hours | Deaths - cumulative total | Deaths - cumulative total per 100000 population | Deaths - newly reported in last 7 days | Deaths - newly reported in last 7 days per 100000 population | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Falkland Islands (Malvinas) | 4407 | 2632.0 | 126.529 | 75.567 | 1775.0 | 50.962 | AstraZeneca - AZD1222 | NaN | 1.0 | 61 | 1751.36 | 1 | 28.71 | 0 | 0 | 0.00 | 0 | 0.00 |
| 1 | Saint Helena | 7892 | 4361.0 | 129.995 | 71.833 | 3531.0 | 58.162 | AstraZeneca - AZD1222 | NaN | 1.0 | 0 | 0.00 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 |
| 2 | Faroe Islands | 78347 | 40205.0 | 160.334 | 82.278 | 38142.0 | 78.056 | Moderna - mRNA-1273, Pfizer BioNTech - Comirnaty | NaN | 2.0 | 987 | 2019.85 | 8 | 16.37 | 2 | 2 | 4.09 | 1 | 2.05 |
| 3 | Greenland | 78240 | 40385.0 | 137.814 | 71.135 | 37855.0 | 66.679 | Moderna - mRNA-1273 | NaN | 1.0 | 122 | 214.89 | 13 | 22.90 | 0 | 0 | 0.00 | 0 | 0.00 |
| 4 | Jersey | 196865 | 80602.0 | 182.627 | 74.773 | 76133.0 | 70.627 | Moderna - mRNA-1273, AstraZeneca - AZD1222, Pf... | NaN | 3.0 | 8429 | 7819.40 | 570 | 528.78 | 169 | 70 | 64.94 | 1 | 0.93 |
One of those columns vaccines_used is actually a common mistake. It's got several variables in one column. Let's spread that out and turn it into numbers we can use
merged_df.dropna(subset=['VACCINES_USED'], inplace=True)
merged_df["AstraZeneca"] = 0
merged_df["Pfizer"] = 0
merged_df["Moderna"] = 0
merged_df["Janssen"] = 0
merged_df["Sinopharm"] = 0
merged_df["Covishield"] = 0
for index, row in merged_df.iterrows():
if "AstraZeneca" in row["VACCINES_USED"]:
merged_df.at[index, "AstraZeneca"] = 1
else:
merged_df.at[index, "AstraZeneca"] = 0
if "Pfizer" in row["VACCINES_USED"]:
merged_df.at[index, "Pfizer"] = 1
else:
merged_df.at[index, "Pfizer"] = 0
if "Moderna" in row["VACCINES_USED"]:
merged_df.at[index, "Moderna"] = 1
else:
merged_df.at[index, "Moderna"] = 0
if "Janssen" in row["VACCINES_USED"]:
merged_df.at[index, "Janssen"] = 1
else:
merged_df.at[index, "Janssen"] = 0
if "Covishield" in row["VACCINES_USED"]:
merged_df.at[index, "Covishield"] = 1
else:
merged_df.at[index, "Covishield"] = 0
if "Beijing" in row["VACCINES_USED"]:
merged_df.at[index, "Sinopharm"] = 1
else:
merged_df.at[index, "Sinopharm"] = 0
merged_df.drop(["VACCINES_USED"], axis=1, inplace=True)
merged_df.head()
| COUNTRY | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | Cases - cumulative total | Cases - cumulative total per 100000 population | Cases - newly reported in last 7 days | Cases - newly reported in last 7 days per 100000 population | Cases - newly reported in last 24 hours | Deaths - cumulative total | Deaths - cumulative total per 100000 population | Deaths - newly reported in last 7 days | Deaths - newly reported in last 7 days per 100000 population | AstraZeneca | Pfizer | Moderna | Janssen | Sinopharm | Covishield | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Falkland Islands (Malvinas) | 4407 | 2632.0 | 126.529 | 75.567 | 1775.0 | 50.962 | NaN | 1.0 | 61 | 1751.36 | 1 | 28.71 | 0 | 0 | 0.00 | 0 | 0.00 | 1 | 0 | 0 | 0 | 0 | 0 |
| 1 | Saint Helena | 7892 | 4361.0 | 129.995 | 71.833 | 3531.0 | 58.162 | NaN | 1.0 | 0 | 0.00 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 1 | 0 | 0 | 0 | 0 | 0 |
| 2 | Faroe Islands | 78347 | 40205.0 | 160.334 | 82.278 | 38142.0 | 78.056 | NaN | 2.0 | 987 | 2019.85 | 8 | 16.37 | 2 | 2 | 4.09 | 1 | 2.05 | 0 | 1 | 1 | 0 | 0 | 0 |
| 3 | Greenland | 78240 | 40385.0 | 137.814 | 71.135 | 37855.0 | 66.679 | NaN | 1.0 | 122 | 214.89 | 13 | 22.90 | 0 | 0 | 0.00 | 0 | 0.00 | 0 | 0 | 1 | 0 | 0 | 0 |
| 4 | Jersey | 196865 | 80602.0 | 182.627 | 74.773 | 76133.0 | 70.627 | NaN | 3.0 | 8429 | 7819.40 | 570 | 528.78 | 169 | 70 | 64.94 | 1 | 0.93 | 1 | 1 | 1 | 0 | 0 | 0 |
Much nicer and now we can use the type of vaccine used to help build our model.
Next we have to deal with missing data. This a problem all data scientists deal with. We can take a simple route and simply remove the rows which have missing data, or we can go further into imputation. Imputation is the practice of replacing the missing data with made-up data based on the rest of table. The different techniques can get a little complicated, so we won't go over all of them. First let's get an idea of how much data is missing.
# Calculate how many rows have missing data. It should be a very small amount!
for column in merged_df:
print(merged_df[column].isna().sum(), end=" ")
0 0 3 0 3 3 3 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
There are 222 rows in our dataset. Above you can see how many values are missing in each column. For example, in the first column theres 0 values missing, in the third column there's 3 values missing. In one of the columns "FIRST_VACCINE_DATE" there's 16 values missing. All in all, not that much is missing.
Now, a lot of these nations that are missing data on when the first vaccine was administered are very small often isolated countries that aren't that useful to our dataset. Let's go ahead and drop the nations this applies to.
merged_df.dropna(subset=['FIRST_VACCINE_DATE'], inplace=True)
merged_df.head()
| COUNTRY | TOTAL_VACCINATIONS | PERSONS_VACCINATED_1PLUS_DOSE | TOTAL_VACCINATIONS_PER100 | PERSONS_VACCINATED_1PLUS_DOSE_PER100 | PERSONS_FULLY_VACCINATED | PERSONS_FULLY_VACCINATED_PER100 | FIRST_VACCINE_DATE | NUMBER_VACCINES_TYPES_USED | Cases - cumulative total | Cases - cumulative total per 100000 population | Cases - newly reported in last 7 days | Cases - newly reported in last 7 days per 100000 population | Cases - newly reported in last 24 hours | Deaths - cumulative total | Deaths - cumulative total per 100000 population | Deaths - newly reported in last 7 days | Deaths - newly reported in last 7 days per 100000 population | AstraZeneca | Pfizer | Moderna | Janssen | Sinopharm | Covishield | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 9 | Philippines | 90259621 | 56110301.0 | 82.368 | 51.204 | 37353438.0 | 34.087 | 2021-03-01 | 10.0 | 1612541 | 1471.55 | 50352 | 45.95 | 6879 | 28141 | 25.68 | 823 | 0.75 | 1 | 1 | 1 | 1 | 1 | 0 |
| 10 | Nigeria | 12971729 | 8811055.0 | 6.293 | 4.274 | 4153915.0 | 2.015 | 2021-03-05 | 1.0 | 175264 | 85.02 | 3536 | 1.72 | 505 | 2163 | 1.05 | 29 | 0.01 | 0 | 0 | 0 | 0 | 0 | 1 |
| 11 | Oman | 6046310 | 3121257.0 | 118.401 | 61.122 | 2890827.0 | 56.609 | 2020-12-29 | 5.0 | 297431 | 5824.41 | 2414 | 47.27 | 309 | 3877 | 75.92 | 89 | 1.74 | 1 | 1 | 0 | 0 | 0 | 1 |
| 12 | Curaçao | 202135 | 103606.0 | 123.183 | 63.139 | 96166.0 | 58.605 | 2021-02-24 | 3.0 | 13692 | 8344.05 | 392 | 238.89 | 23 | 127 | 77.40 | 1 | 0.61 | 1 | 1 | 1 | 0 | 0 | 0 |
| 13 | Niue | 2352 | 1202.0 | 145.365 | 74.289 | 1150.0 | 71.075 | 2021-06-08 | 1.0 | 0 | 0.00 | 0 | 0.00 | 0 | 0 | 0.00 | 0 | 0.00 | 0 | 1 | 0 | 0 | 0 | 0 |
for column in merged_df:
print(merged_df[column].isna().sum(), end=" ")
0 0 2 0 2 2 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
We still have some missing data. Looks like the missing data is in the columns "PERSONS_VACCINATED_1PLUS_DOSE", "PERSONS_VACCINATED_1PLUS_DOSE_PER100", "PERSONS_FULLY_VACCINATED", and "PERSONS_FULLY_VACCINATED_PER100". We'd like to fill those in. Let's go ahead and do a Hot Deck Imputation. Hot Deck imputation is the process of imputing values from observations that are most similar to the observations missing data. Basically, for every row with missing data, we're gonna find the observation most similar and copy over the values.
How do we measure similarity? We'll it appears we're trying to impute data related to vaccination rates. So to evaluate similarity, we're going to compare values for total vacciantions per 100,000 people.
def similar_vax_rate(index):
most_similar_row = None
min_difference = sys.float_info.max
for i, row in merged_df.iterrows():
if index != i and not row.isnull().values.any():
new_difference = abs(merged_df.loc[i, "TOTAL_VACCINATIONS_PER100"] - merged_df.loc[index, "TOTAL_VACCINATIONS_PER100"])
if new_difference < min_difference:
min_difference = new_difference
most_similar_row = row
return most_similar_row
for index, row in merged_df.iterrows():
if row.isnull().values.any():
imputation_row = similar_vax_rate(index)
merged_df.at[index, "PERSONS_VACCINATED_1PLUS_DOSE"] = imputation_row["PERSONS_VACCINATED_1PLUS_DOSE"]
merged_df.at[index, "PERSONS_VACCINATED_1PLUS_DOSE_PER100"] = imputation_row["PERSONS_VACCINATED_1PLUS_DOSE_PER100"]
merged_df.at[index, "PERSONS_FULLY_VACCINATED"] = imputation_row["PERSONS_FULLY_VACCINATED"]
merged_df.at[index, "PERSONS_FULLY_VACCINATED_PER100"] = imputation_row["PERSONS_FULLY_VACCINATED_PER100"]
Data has been imputed!
for column in merged_df:
print(merged_df[column].isna().sum(), end=" ")
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Look! no more missing values. Let's start doing some real analysis...
It's time for some "exploratory analysis" and visualizations. We're going to look for patterns in the data, view relationships between variables, analyze the skew, and maybe transform the dataset.
Expect to see a lot of graphs!
We're going to be using Seaborn and matplotlib for much of the graphing.
First, let's start with a simple bar graph. We'll try and get a sense of how the pandemic has affected the world. Let's make a graph documenting the 30 countries with the highest cumulative deaths.
bar_df = merged_df.sort_values(["Deaths - cumulative total"], ascending=False)
bar_df = bar_df.iloc[:30,:]
plt.figure(figsize = (300, 120))
plt.title('Total Deaths per Nation')
sns.set(font_scale=10)
deaths_hist = sns.barplot(x = 'COUNTRY', y = "Deaths - cumulative total", data = bar_df)
deaths_hist.set_xticklabels(deaths_hist.get_xticklabels(), rotation=40, ha="right");
deaths_hist.set(xlabel='Countries', ylabel='Total cumulative deaths');
These countries you see above lost the most. Of course this will mostly include the nations with larger populations, but COVID-19 is a global problem. We believe the scale of the tragedy should be considered globally. You can see above that millions and millions of people have died because of the pandemic.
Let's see how the world has fought this tragedy. Let's look at the countries which have vaccinated the most.
bar_df = merged_df.sort_values(["TOTAL_VACCINATIONS"], ascending=False)
bar_df = bar_df.iloc[:30,:]
plt.figure(figsize = (300, 100))
plt.title('Total Vaccinations')
sns.set(font_scale=10)
deaths_hist = sns.barplot(x = 'COUNTRY', y = "TOTAL_VACCINATIONS", data = bar_df)
deaths_hist.set_xticklabels(deaths_hist.get_xticklabels(), rotation=40, ha="right");
deaths_hist.set(xlabel='Countries', ylabel='Total Vaccinations (Billions)');
Above we can see that billions of vaccinations have been distributed around the globe. Obviously, the world has made a great effort. One interesting thing to make note of is how the countries with the highest cumulative deaths also make the list of countries with highest total vaccinations.
This data is interesting to observe, but it's time for real analysis. We're going to have the explore relationships between variables in order to create a model.
Let's look at how vaccinations rates correlate with recent cases.
# Scatterplot Graph fully vaccianted per 100 people vs cases newly reported last 7 days per 100,000 population
X = merged_df["PERSONS_FULLY_VACCINATED_PER100"]
Y = merged_df["Cases - newly reported in last 7 days per 100000 population"]
plt.title('Vaccinations Versus Recent Cases')
sns.set(font_scale=1)
res = stats.linregress(X, Y)
graph = sns.scatterplot(x = X,y = Y, data=merged_df, s=10)
graph.set(xlabel='Fully Vaccinated per 100', ylabel='Cases in last 7-days');
plt.plot(X, res.intercept + res.slope*X)
[<matplotlib.lines.Line2D at 0x7f050eec1c90>]
Hmm, it looks like the correlation between cases and fully vaccinated individuals appears to be positive. This means that, the more fully vaccinated individuals there are, the more cases appear.
This could have to do with population sizes.
Another thing we are aware of is that the COVID-19 vaccine is not 100% effective at preventing infection. It is still possible to get a breakthrough case.
We do know however, that the vaccine prevents hospitalizaton and death
Let's take a look at how the vaccine rate and death rates correlate.
# Scatterplot Graph fully vaccinated per 100 people vs deaths newly reported last 7 days
X = merged_df["PERSONS_FULLY_VACCINATED_PER100"]
Y = merged_df["Deaths - newly reported in last 7 days per 100000 population"]
plt.title('Vaccinations Versus Recent Deaths')
sns.set(font_scale=1)
res = stats.linregress(X,Y)
graph = sns.scatterplot(x = X,y = Y, data=merged_df, s=10)
graph.set(xlabel='Fully Vaccinated per 100', ylabel='Deaths in last 7-days per 100,000');
plt.plot(X, res.intercept + res.slope*X)
[<matplotlib.lines.Line2D at 0x7f050ed16310>]
It still is a rather positive correlation, although the slope definitley is less steep.
This could be happening for a variety of reasons, such as population density.
Let's see how total vaccinations compare to fully vaccinated.
# Scatterplot Graph total vaccinated per 100 people vs cases newly reported last 7 days
X = merged_df["TOTAL_VACCINATIONS_PER100"]
Y = merged_df["Cases - newly reported in last 7 days per 100000 population"]
plt.title('Vaccinations Versus Recent Cases')
sns.set(font_scale=1)
res = stats.linregress(X,Y)
graph = sns.scatterplot(x = X,y = Y, data=merged_df, s=10)
graph.set(xlabel='Total Vaccinated per 100', ylabel='Cases in last 7-days per 100,000');
plt.plot(X, res.intercept + res.slope*X)
[<matplotlib.lines.Line2D at 0x7f0510d99890>]
# Scatterplot Graph total vaccinated per 100 people vs cases newly reported last 7 days
X = merged_df["TOTAL_VACCINATIONS_PER100"]
Y = merged_df["Deaths - newly reported in last 7 days per 100000 population"]
plt.title('Vaccinations Versus Recent Deaths')
sns.set(font_scale=1)
res = stats.linregress(X,Y)
graph = sns.scatterplot(x = X,y = Y, data=merged_df, s=10)
graph.set(xlabel='Total Vaccinated per 100', ylabel='Deaths in last 7-days per 100,000');
plt.plot(X, res.intercept + res.slope*X)
[<matplotlib.lines.Line2D at 0x7f0510ca61d0>]
It still appears to be the same. Lets graph based on just the 30 countries with the highest death totals to see if vaccines at least helped alleviate that death rate.
population_df = merged_df.sort_values(["Deaths - cumulative total"], ascending=False)
population_df = population_df.iloc[:30,:]
X = population_df["TOTAL_VACCINATIONS_PER100"]
Y = population_df["Deaths - newly reported in last 7 days per 100000 population"]
plt.title('Vaccinations Versus Recent Deaths')
sns.set(font_scale=1)
res = stats.linregress(X,Y)
graph = sns.scatterplot(x = X,y = Y, data=population_df, s=10)
graph.set(xlabel='Total Vaccinated per 100', ylabel='Deaths in last 7-days per 100,000');
plt.plot(X, res.intercept + res.slope*X)
[<matplotlib.lines.Line2D at 0x7f050ec66ed0>]
Sucess! It appears that the vaccines do have some proof of lowering the death rate.
At this point we want to use some Linear Regression in order to obtain a predictive model of our data. Once our model is made we can predict values for our data that don't exist. For instance, we can see how an even HIGHER vaccine rate would look for a country. OR on the other hand, we could see what it would look like if a country was not vaccinated at all.
# Linear Regression
def lin_reg(X, Y):
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3)
# Make a linear regression model
model = linear_model.LinearRegression()
model.fit(X_train, Y_train)
Y_pred = model.predict(X_test)
# Return important model details
print('Ordinary Least Squares (OLS)')
print('Coefficients: ', model.coef_[0])
print('Intercept: ', model.intercept_)
print('Mean squared error: %.2f'
% mean_squared_error(Y_test, Y_pred))
print('Coefficient of determination: %.2f'
% r2_score(Y_test, Y_pred))
# Return the testing X data and Y predictions created by the model.
return X_test,Y_pred
X = merged_df['PERSONS_FULLY_VACCINATED_PER100'].values.reshape(-1, 1)
Y = merged_df['Cases - newly reported in last 7 days per 100000 population'].values.reshape(-1, 1)
X_test,Y_pred = lin_reg(X, Y)
plt.scatter(X, Y, color='red')
plt.plot(X_test, Y_pred, color='blue')
plt.xlabel("Fully vaccinated persons per 100 people")
plt.ylabel("Newly reported cases in last 7 days per 100k people")
plt.title("Fully vaccinated vs. New Cases")
plt.show()
Ordinary Least Squares (OLS) Coefficients: [1.26223978] Intercept: [29.16997555] Mean squared error: 14031.54 Coefficient of determination: 0.05
Let's see how the model does with deaths rather than cases
X = merged_df['PERSONS_FULLY_VACCINATED_PER100'].values.reshape(-1, 1)
Y = merged_df["Deaths - newly reported in last 7 days per 100000 population"].values.reshape(-1, 1)
X_test,Y_pred = lin_reg(X, Y)
plt.scatter(X, Y, color='red')
plt.plot(X_test, Y_pred, color='blue')
plt.xlabel("Fully vaccinated persons per 100 people")
plt.ylabel("Newly reported deaths in last 7 days per 100k people")
plt.title("Fully vaccinated vs. New Deaths")
plt.show()
Ordinary Least Squares (OLS) Coefficients: [0.00801362] Intercept: [0.56313258] Mean squared error: 3.05 Coefficient of determination: -0.04
Now let's revist the smaller 30 country data set to see how the model behaves.
population_df = merged_df.sort_values(["Deaths - cumulative total"], ascending=False)
population_df = population_df.iloc[:30,:]
X = population_df['PERSONS_FULLY_VACCINATED_PER100'].values.reshape(-1, 1)
Y = population_df["Deaths - newly reported in last 7 days per 100000 population"].values.reshape(-1, 1)
X_test,Y_pred = lin_reg(X, Y)
plt.scatter(X, Y, color='red')
plt.plot(X_test, Y_pred, color='blue')
plt.xlabel("Fully vaccinated persons per 100 people")
plt.ylabel("Newly reported deaths in last 7 days per 100k people")
plt.title("Fully vaccinated vs. New Deaths")
plt.show()
Ordinary Least Squares (OLS) Coefficients: [-0.00623298] Intercept: [1.70965418] Mean squared error: 9.48 Coefficient of determination: -0.05
After going through the Data Science process and further exploring the correlation between vaccinations and COVID-19 cases/deaths, we were able to see that data analysis can lead to suprising results.
Going into our exploration we had expected a clear negative correlation between vaccine rates and deaths/cases due to COVID-19, but that was not the case. It was only when we isolated certain countries that we were able to gather a negatice correlation.
The COVID-19 virus is still novel. With an increase in new information and varients gathered each day, it is still tough to make predictions and models on its behavior. When we first began this process, the Omnicron variant had not existed to public knowledge, and now as we finish it, we are learning that this variant is infecting even those that are vaccinated.
If we were to do this again we would try to include many more variables, since this is an everchanging virus. We would add a city density variable to account for the mass amount of vaccines and death rates that we believe to have caused the positive correlation. We also would add some measurment of mask wearing, weather, etc.